Information-theoretic inference of common ancestors 
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Abstract 

A directed acyclic graph (DAG) partially represents the conditional independence struc- 
ture among observations of a system if the local Markov condition holds, that is, if every 
variable is independent of its non-descendants given its parents. In general, there is a whole 
class of DAGs that represents a given set of conditional independence relations. We are inter- 
ested in properties of this class that can be derived from observations of a subsystem only. To 
this end, we prove an information theoretic inequality that allows for the inference of common 
ancestors of observed parts in any DAG representing some unknown larger system. More ex- 
plicitly, we show that a large amount of dependence in terms of mutual information among 
the observations implies the existence of a common ancestor that distributes this information. 
Within the causal interpretation of DAGs our result can be seen as a quantitative extension 
of Reichenbach's Principle of Common Cause to more than two variables. 
Our conclusions are valid also for non-probabilistic observations such as binary strings, since 
we state the proof for an axiomatized notion of 'mutual information' that includes the stochas- 
tic as well as the algorithmic version. 



1 Introduction 



■ Causal relations among components X\ , . . . , X n of a system are commonly modeled in terms of a 

directed acyclic graph (DAG) in which there is an edge Xi — > Xj whenever Xi is a direct cause of 
Xj. Further, it is usually assumed that information about the causal structure can be obtained 
through interventions in the system. However, there are situations in which interventions are not 
feasible (too expensive, unethical or physically impossible) and one faces the problem to infer 
causal relations from observational data only. To this end, postulates linking observations to the 
underlying causal structure have been employed, one of the most fundamental being the causal 
Markov condition [I] [5] . It connects the underlying causal structure to conditional independencies 
among the observations. Explicitly it states that every observation is independent of its non-effects 
given its direct causes. It formalizes the intuition, that the only relevant components of a system 
for a given observation are its direct causes. 

In terms of DAGs, the causal Markov condition states that a DAG can only be a valid causal model 
of a system if every node is independent of its non-descendants given its parents. The graph is 
then said to fulfill the local Markov condition [3] . Consider for example the causal hypothesis 
X — > Y •(— Z on three observations X, Y and Z. Assuming the causal Markov condition, the hy- 
pothesis implies that X and Z are independent. Violation of this independence then allows one to 
exclude this causal hypothesis. But note that in general there are many DAGs that fulfill the local 
Markov condition with respect to a given set of conditional independence relations. For example, 
all three DAGs X ^ Y ^ Z, X ^ Y ^ Z and X <- Y «- Z encode that X is independent of Z 
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Figure 1: Two causal hypothesis for which the causal Markov condition does not imply conditional 
independencies among the observations X\, X2 and A3. Thus they can not be distinguished using 
qualitative criteria like the common cause principle (unobserved variables are indicated as dots). 
However, the model on the right can be excluded if the dependence among the Xi exceeds a certain 
bound. 

given Y and it can not be decided from information on conditional independences alone, which is 
the true causal model. Nevertheless, properties that are shared by all valid DAGs (e.g. an edge 
between X and Y in the example) provide information about the underlying causal structure. 
The causal Markov condition is only expected to hold for a given set of observations if all rele- 
vant components of a system have been observed, that is if there are no confounders (causes of 
more than two observations that have not been measured). It can then be proven by assuming 
a functional model of causality [U HI [5]. As an example, consider the observations Xi, . . . , X n to 
be jointly distributed random variables. In this case, the causal Markov condition can be derived 
for a given DAG on X±, . . . ,X n from two assumptions: (1) every variable Xi is a deterministic 
function of its parents and an independent (possibly unobserved) noise variable iVj and (2) the 
noise variables JVj are jointly independent. However, in this paper we assume that our observations 
provide only partial knowledge about a system and ask for structural properties common to all 
DAGs that represent the independencies of some larger set of elements. 

To motivate our result, assume first that our observation consists of only two jointly distributed 
random variables X\ and Xi which are stochastically dependent. Reichenbach [6] postulated al- 
ready in 1956 that the dependence of X\ and X2 needs to be explained by (at least) one of the 
following cases: X\ is a cause of X2, or X2 is a cause of Xi, or there exists a common cause 
of X\ and X2. This link between dependence and the underlying causal structure is known as 
Reichenbach 's principle of common cause. It is easily seen that by assuming X\ and X2 to be 
part of some unknown larger system whose causal structure is described by a DAG G, then the 
causal Markov condition for G implies the principle of common cause. Moreover, we can subsume 
all three cases of the principle if we formally allow a node to be an ancestor of itself and arrive at 

Common cause principle. If two observations X\ and X2 are dependent, then they must have 
a common ancestor in any DAG modeling some possibly larger system. 

Our main result is an information theoretic inequality that enables us to generalize this principle 
to more than two variables. It leads to the 

Extended common cause principle (informal version). Consider n observations X\, . . . , X n , 
and a number c, 1 < c < n. If the dependence of the observations exceeds a bound that depends on 
c, then in any DA G modeling some possibly larger system there exist c nodes out of X\ , . . . , X n 
that have a common ancestor. 

Thus, structural information can be obtained by exploiting the degree of dependence on the subsys- 
tem and we would like to emphasize that, in contrast to the original common cause principle, the 
above criterion provides means to distinguish among cases with the same independence structure 
of the observed variables. This is illustrated in Figure [1] 
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Above, the extended common cause principle is stated without making explicit the kind of ob- 
servations we consider and how dependence is quantified. In the main case we have in mind, the 
observations are jointly distributed random variables and dependence is quantified by the mu- 
tual information [7] function. Then the extended common cause principle (Theorem I10[) relates 
stochastic dependence to a property of all Bayesian networks that include the observations. 
However, the result holds for more general observations (such as binary strings) and for more 
general notions of mutual information (such as algorithmic mutual information [8]). Therefore we 
introduce an 'axiomatized' version of mutual information in the following Section and describe 
how it can be connected to a DAG. Then, in Section [3] we prove a theorem on the decomposition 
of information about subsets of a DAG out of which the extended common cause principle then 
follows as a corollary. Apart from a larger area of applicability, we think that an abstract proof 
based on an axiomatized notion of information better illustrates that the result is independent of 
the notion of 'probability'. It only relies on the basic properties of (stochastic) mutual informa- 
tion (sec Definition [IJ . Finally, in Section U we describe the result in more detail within different 
contexts and relate it to the notion of redundancy and synergy that was introduced in the area of 
neural information processing. 

2 General mutual information and DAGs 

Before introducing a general notion of mutual information, let us describe how it is connected to a 
DAG in the stochastic setting. Assume we are given an observation of n discrete random variables 
X\, . . . , X n in terms of their joint probability distribution p(X\, . . . , X n ). Write [n] = {1, . . . , n} 
and for a subset S C [n] let X,s be the random variable associated with the tuple (Xi)i^s- Assume 
further, that a directed acyclic graph (DAG) G is associated with the nodes Xi, . . . , X n , that 
fulfills the local Markov condition [3J: for all i, (1 < i < n) 

Xi X Xndi | X pai , (1) 

where ndi and pai denote the subset of indices corresponding to the non-descendants and to 
the parents of Xi in G. The tuple (G,p(XiS)) is called a Bayesian net [9] and the conditional 
independence relations imply the factorization of the joint probability distribution 

P(XI , . . . , X n ) = Y\_ P( X i\ X pa t ) , 
ie[n] 

where small letters Xi stand for values taken by the random variables Xi . From this factorization 
it follows that the joint information measured in terms of Shannon entropy [JJ decomposes into a 
sum of individual conditional entropies 

n 

H{X 1 ,...,X n )=Y J H{X i \X pai ). (2) 

i=l 

Shannon entropy can be considered as absolute measure of information. However, in many cases 
only a notion of information relative to another observation may be available. For example, in 
the case of continuous random variables, Shannon entropy can be negative and hence may not 
be a good measure of the information. Therefore we would like formulate our results based on a 
relative measure, such as mutual information, which, moreover, induces a notion of independence 
in a natural way. This can be achieved by introducing a specially designated variable Y relative to 
which information will be quantified. Y can for example be thought of as providing a noisy mea- 
surement of the Xj„] (Fig. [2] (a)). Then, with respect to a joint probability distribution p(Y, Xr n i) 
we can transform the decomposition of entropies into a decomposition of mutual information [7] 

n 

I(Y : X [n] )>^2l(Y : Xi\X pai ). (3) 
i=l 
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(a) 





Figure 2: The graph in (a) shows a DAG on nodes Xi,... ,X$ whose observation is modeled by 
a leaf node Y (e.g. a noisy measurement). Figure (6) shows a DAG-model of observed elements 
O x = {X x } and 2 = {X 4 ,X 5 }. 



For a proof and a condition for equality see Lemma [3] below. In the case of discrete variables, 
Shannon entropy H(Xi) can be seen as mutual information of Xi with a copy of itself: H(Xi) 
1 1 X, : Xi). Therefore we can always choose p(F|X[ n j) such that Y = X\ n i and the decomposition 
of entropies in (J2J) is recovered. We are interested in decompositions as in |[5J) and Q , since their 
violation allows us to exclude possible DAG structures. 

However, note that the above relations are not yet very useful, since they require, through the 
assumption of the local Markov condition, that we have observed all relevant variables of a system. 
Before we relax this assumption in the next section we introduce mutual information measures on 
general observations. 

Definition 1 (measure of mutual information). 

Given a finite set of elements O a measure of mutual information on O is a three-argument function 
on the power set 

I : 2° x 2° x 2° -)• E, (A, B, C) I(A : B \C) 
such that for disjoint sets A,B,C,D C 2° it holds: 



J(A:0) = 

I{A:B\C) > 

I(A : B \C) = 

I(A: (BUC)\D) = 



(normalization) 
(non-negativity) 
I(B : A\C) (symmetry) 
I(A : B | C U D) + I (A : C | D) 



(chain rule). 



We say A is independent of B given C and write (A JL B \C) iff I(A : B \C) = 0. Further we will 
generally omit the empty set as a third argument and substitute the union by a comma, hence we 
write I(A : B) instead of I(A : B |0) and I(A : B, C) instead of I(A : B U C). 

Of course, mutual information of discrete as well as of continuous random variables is included 
in the above definition. Further, in Section 14.21 wc will discuss a recently developed theory of 
causal inference [4] based on algorithmic mutual information of binary string^). We now state two 
properties of mutual information that we need later on. 

Lemma 2 (properties of mutual information). 

Let I be a measure of mutual information on a set of elements O . Then 



1 Mutual information of composed quantum systems satisfies the definition as well, because it can be defined 
in formal analogy to classical information theory if Shannon entropy is replaced by von Neumann entropy of a 
quantum state. The properties of mutual information stated above have been used to single out quantum physics 
from a whole class of no-signaling theories 1101 . 
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(i) (data processing inequality) For three disjoint sets A,B,C CO 

I(A:C\B)=0 =>• I(A : B) > I(A : C) . 

(ii) (increase through conditioning on independent sets) 
For three disjoint sets A,B,CQO 

I(A:C\B)=0 =► I(Y : A\B) < I(Y : A \B, C) , (4) 

where Y is an arbitrary set Y C O disjoint from the rest. Further, the difference is given by 
I(A:C\B,Y). 

Proof: (i) Using the chain rule two times 

I(A:B) = I(A:B)+I(A:C\B) = I(A:B,C) 
= I(A:C)+I(A:B\C)>I(A:C), 

where the last inequality follows from non-negativity of /. To prove (ii) we again use the chain 
rule 

I(Y : A\B) — I(Y : A \B, C) = I(Y : A \B) - I(Y, C : A \B) + I(A : C\B) 

= -I(A:C\B,Y) <0. 

□ 

As in the stochastic setting, we can connect a DAG to the conditional independence relation that 
is induced by mutual information: we say that a DAG on a given set of observations fulfills the 
local Markov condition if every node is independent of its non-descendants given its parents. Fur- 
thermore, we show in Appendix [A] that the induced independence relations are sufficiently nice, 
in the sense that they satisfy the semi-graphoid axioms [IT] . This is useful because it implies 
that a DAG that fulfills the local Markov condition is an efficient partial representation of the 
conditional independence structure. Namely, conditional independence relations can be read off 
the graph with the help of a criterion called d-scparation [I] (see Appendix [A] for details). 

We conclude with a general formulation of the decomposition of mutual information that we 
already described in the probabilistic case. 

Lemma 3 (decomposition of mutual information). 

Let I be a measure of mutual information on elements 0[„] = {0\, . . . , O n } and Y . Further let G 
be a DAG with node set 0[„] that fulfills the local Markov condition. Then 

n 

I(Y:0 [n] )>Y,I{Y -Oi\O pai ) (5) 

i=l 

with equality if conditioning on Y does preserve the independences of the local Markov condition: 
that is for all i 

Oi ± O ndi \(O pai ,Y) . (6) 

Proof: Assume the Oi are ordered topologically with respect to G. The proof is by induction 
on n. The lemma is trivially true if n = 1 with equality. Assume that it holds for k — 1 < n. It 
is easy to see that the graph Gk with nodes Orj,i that is obtained from G by deleting all but the 
first k nodes fulfills the local Markov condition with respect to 0[y . By the chain rule 

I(Y : [k] ) = I(Y : [k _ 1} ) + I(Y : O k \0 [k _ 1} ) 

and we are left to show that I(Y : O k \0[ k _i]) > I(Y : O k |O pafc )- Since the local Markov condi- 
tion holds, we have O k _L 0[fc-i]\ P a fc \O pak and the inequality follows by applying (QJ. Further, 
by property (ii) of the previous Lemma, equality holds if for every k: O k X 0[ k _^\p ak | (O pak ,Y) 
which is implied by □ 

In the next section we derive a similar inequality in the case in which only the mutual information 
of Y with a subset of the nodes 0[„] is known. 
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3 Partial information about a system 



We have shown that the information about elements of a system described by a DAG decomposes 
if the graph fulfills the local Markov condition. In this section we derive a similar decomposition 
in cases where not all elements of a system have been observed. This decomposition will of course 
depend on specific properties of G and, in turn, enable us to exclude certain DAGs as models of 
the total system whenever we observe a violation of such a decomposition. 

More precisely, we are interested in properties of the class of DAG-models of a set of observations 
that we define as follows (see Figure [2] (6)). 

Definition 4 (DAG- model of observations). 

An observation of elements Oj„] = {Oi, . . . , O n } with respect to a reference object Y and mutual 
information measure I is given by the values of I(Y : Os) for every subset S C [n\. 
A DAG G with nodes X together with a measure of mutual information Iq on X is a DA G-model 
of an observation, if the following holds 

(i) each observation Oi is a subset of the nodes X of G. 

(ii) G fulfills the local Markov condition with respect to Iq 

(iii) Iq is an extension of /, that is Ig(Y '■ Os) = I(Y : Os) for all S C [n]. 

(iv) Y is a leaf node (no descendants) of G . 

The first three conditions state that, given the causal Markov condition, G is a valid hypothesis on 
the causal relations among components of some larger system including the Or n i that is consistent 
with the observed mutual information values. Condition (iv) is merely a technical condition due 
to the special role of Y as an observation of the Or n i external to the system. 
As an example, if the Oi and Y are random variables with joint distribution p(Ot n uY), a DAG- 
model G with nodes X is given by the graph structure of a Bayesian net with joint distribution 
p(X), such that the marginal on Or„i and Y equals p(0[ n ],Y). Moreover, if Y is a copy of 0\ n ] 
then an observation in our sense is given by the values of the Shannon entropy H(Os) for every 
subset S C [n]. 

The general question posed in this paper can then be formulated as follows: What can be learned 

from an observation given by the values I(Y : Os) about the class of DAG-models? 

As a first step we present a property of mutual information about independent elements. 

Lemma 5 (submodularity of I). 

If the Oi are mutually independent, that is I(Oi : Or n iw) = for all i, then the function [n] D 
S — > —I(Y : Os) is submodular, that is, for two sets S,TC [n] 



I(Y : Os) + I(Y : T ) < I(Y : SuT ) + T(Y : SnT ) ■ 
Proof: For two subsets S, T C [n] write S' = S\(S n T) and T = T\(S n T). Using the chain rule 



where the inequality follows from property (|4]) of mutual information. □ 

Hence, a violation of submodularity allows one to reject mutual independence among the Oi and 
therefore to exclude the DAG that does not have any edges from the class of possible DAG-models 
(the local Markov condition would imply mutual independence) . 

We now broaden the applicability of the above Lemma based on a result for submodular functions 
from 12]: Wc assume that there are unknown objects X = {Xi, . . . , X r } which are mutually 



we have 



T(Y : Osut) + I(Y : SnT ) 



= I(Y : O s )+I(Y 
> I(Y : O s ) + I(Y 
= I(Y : O s ) + I(Y 



T >\O s )+I(Y :£W) 
T '\OsnT) + I(Y -.Osnr) 
Or), 
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(b) 




Figure 3: (a) shows four subsets Oi, . . . , O4 of independent elements X\,...,X% 'observed by' 
Y. Note that the intersection of three sets Oi is empty, hence di < 2 for all i = 1,...,4 in 
Proposition [6] and therefore I(Y : Out) > jYlt=i^Y : Oi)- (b) shows a DAG-model in gray. 
The observed elements 0\, . . . , O4 are subsets of its nodes. One can check that the DAG docs 
not imply any conditional independencies among the O, (e.g. with the help of the d-separation 
criterion, sec Appendix Nevertheless, there is no common ancestor of all four observations 
( ^i=i an (Oi) = 0). Since Y only depends on the Oi, inequality (|10p of Theorem [7] implies 
I(Y:0 [A] )>lJ:UliY--O i ). 



independent and that the observed elements Oi C X will be subsets of them (see Figure [3] (a)). In 
contrast to the previous lemma it is not required anymore, that the Oi are mutually independent 
themselves. It turns out, that the way the information about the Oi decomposes allows for the 
inference of intersections among the sets Oi, namely 

Proposition 6 (decomposition of information about sets of independent elements). 
Let X = {Xi, . . . , X r } be mutually independent objects, that is I(Xj : Xuy) = for all j. Let 
0[ n ] — {0\, . . . , O n }, where each Oi C X is a non-empty subset of X . For every i € [to] let di be 
maximal such that Oi has non-empty intersection with di — 1 sets out of Or„i distinct from Oi. 
Then the information about the Or„i can be bounded from below by 

n 1 

I(Y: {n] )>Y,jI{Y-Oi). (7) 

. eii 

For an illustration see Figure [3ta). Even though the proposition is actually a corollary of the 
following theorem, its proof is given in appendix [B] since it is, unlike the theorem, independent of 
graph theoretic notions . 

As a trivial example consider the case where 0\ — O2 = O C X are identical subsets. Then 
d\ = d 2 — 2 and 

I(Y : O) = \l{Y : Or) + \l{Y : 2 ), 

hence equality holds in ([?])• In general, if there is an element in Oi, that is also in fc — 1 different sets 
Oj, then di > k and we account for this redundancy in dividing the single information I(Y : Oi) 
by at least fc. 

Independent elements can always be modeled as root nodes of a DAG. The following theorem, that 
is our main result, generalizes the proposition by connecting the information about observations 
Oi to the intersection structure of associated ancestral sets. For a given DAG G, a set of nodes A 
is called ancestral, if for every edge v — > w in G such that w is in A, also v is in A. Further, for a 
subset of nodes S, we denote by an(S) be the smallest ancestral set that contains 5. Elements of 
an(S) will be called ancestors of S. 

Theorem 7 (decomposition of ancestral information). 

Let G be a DAG-model of an observation of elements 0[ n ] = {Oi, . . . , O n }. For every i let di be the 
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maximal number such that the intersection ofan(Oi) with di—1 distinct sets an(Oi 1 ) 1 . . . , an(Oi dl ) 
is non-empty. Then the information about all ancestors of Oui can be bounded from below by 

n ^ n 1 

I(Y : an{0 [n] )) > £ -I(Y : an(Oi)) > ]T -I{Y : Oi) . (8) 

i— 1 1 

Furthermore, ifY only depends on whole system X through the 0\ n u that is 

Y ± X\(0 [n] U {Y}) | [n] (9) 

we obtain an inequality containing only known values of mutual information 

n 

I{Y:0 [n] )>Y, T I{Y-Oi). (10) 

2—1 

The proof is given in Appendix [Cl and an example is illustrated in Figure [3jb) . If all quantities 
except the structural parameters di are known, inequality (fTU)) can be used to obtain information 
about the intersection structure among the Oi that is encoded in the provided that the inde- 
pendence assumption © holds. Even if © does not hold but information on an upper bound of 
I(Y : an(0[„])) is available (e.g. in terms of the entropy of Y) information about the intersection 
structure may be obtained from (|8|) . The following corollary additionally provides a bound on the 
minimum information about ancestral sets. 

Corollary 8 (inference of common ancestors, local version). 

Given an observation of elements 0[„] = {0±, . . . ,O n }, assume that for natural numbers c = 
(ci, . . . , c„) with (1 < Cj < n — 1) we observe 

n 

e c := ]T -I(Y : 0<) - I(Y : an(0 [n] )) > 0. (11) 

- i 

2 — 1 

Let G be an arbitrary DAG-model of the observation. For every Oi, let A Ci +\ be the set of common 
ancestors in G of Oi and at least Ci elements of 0[ n ] different from Oi . Then the joint information 
about all common ancestors can be bounded from below by 

I(Y : U? =1 Ah+i) > (E--l)^c > 0. 

i ^i 

In particular, for an index i S [n] we must have A Ci +\ ^ 0, hence there exists a common ancestor 
of Oi and at least Ci elements o/0[„] different from Oi. 

The proof is given in Appendix [D] Theorem [7] and its corollary are our most general results but 
due to ease of interpretation we illustrate them in the next section only in the speciale case in 
which all Ci arc equal (Cor. |9|) to obtain a lower bound on the information about all common 
ancestors of at least c + 1 elements Oi. 

To conclude this section, we ask what is the maximum amount of information that one can expect 
to obtain about the intersection structure of ancestral sets of a DAG-model of an observations. The 
main requirement for a DAG-model G is, that it fulfills the local Markov condition with respect 
to some larger set X of elements. This will remain true if we add nodes and arbitrary edges in 
a way that G remains acyclic. Therefore, if G contains a common ancestor of c elements we can 
always construct a DAG-model G' that contains a common ancestor of more than c elements (e.g. 
the DAG-model on the right hand side of Fig. [TJ can be transformed in the one on the left hand 
side). We conclude that without adding minimality requirements for the DAG- models (such as 
the causal faithfulness assumption [5]) only assertions on ancestors of a minimal number of nodes 
can be made. 
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4 Structural implications of redundancy and synergy 



The results of the last section can be related to the notions of redundancy and synergy. In the 
context of neuronal information processing, it has been proposed |13| to capture the redundancy 
and synergy of elements 0[„] = {Oi, . . . , O n } with respect to another element Y using the function 

n 

r(Y) :=£/(Y : { ) - I(Y : [n] ) , (12) 

i=l 

where / is a measure of mutual information. Thus r relates information that Y has about the 
single elements to information about the whole set. 

If the sum of informations about the single Oi is larger than the information about whole set 
(r(Y) > 0), the 0[ n ] are said to be redundant with respect to Y. This may be the case if Y 
'contains' information that is shared by multiple Oi. In general, if the Oi do not share any 
information, that is, if they are mutually independent, then they can not be redundant with 
respect to any Y (this follows from Lemma HJ). 

On the other hand, if the information of Y about the whole set of elements is larger than about 
its single elements (r(Y) < 0), the Or n i are called synergistic with respect to Y. This may for 
example be the case if Y is generated through a function Y = /(Oi, . . . , O n ) and the function 
value contains little information about each argument (as is the case for the parity function, see 
below). If, instead, Y is a copy of the 0[„], then r(Y) > and thus the 0[ n ] are not synergetic 
with respect to Y. 

To connect our results to the introduced notion of redundancy and synergy, we introduce the 
following version of r parametrized by a parameter c G {1, . . . , n} 

n 

r c {Y) := -Y^HY : O t ) - I(Y : [n] ) . (13) 
i=l 

Intuitively, if r c (Y) > for large c, then the Oi are highly redundant with respect to Y. Corollary 
[5] of the last section implies that high redundancy implies common ancestors of many Oi . 

Corollary 9 (redundancy explained structurally). 

Let an observation of elements Or n i = {Oi, . . . ,O n } be given by the values of I(Y : Og) for any 
subset S C [n]. Ifr c (Y) > 0, then in any D AG-model of the observation in which Y only depends 
on X through 0[„]EL there exists a common ancestor of at least c + 1 elements of 0[„] . 

In the following two subsections we discuss this result in more detail for the cases in which the 
observed elements are discrete random variables and binary strings. 

4.1 Common ancestors of discrete random variables 

Let Xr ra ] = {X\ , . . . , X n } and Y be discrete random variables with joint distribution p(Xr n i , Y) and 
let / denote the usual measure of mutual information given by the Kullback-Leibler divergence of 
p from its factorized distribution [7] . If Y = X[ n ] is a copy of the X[ n ] then I(Y : X[ n ]) = H(X[ n ]), 
where H denotes the Shannon entropy. In this case the redundancy n {Xt n ] ) is equal to the multi- 
information [14] of the . Moreover r c gives rise to a parametrized version of multi- information 

n 1 

I c (Xi, ■ ■ ■ ,X n ) := 22 -H{Xi) - H(X[ n ]) , 
i=i c 

and from Corollary [5] we obtain 

2 We formulate the independence assumption as Y 1 A?|0[„], where X denotes all nodes of the DAG-model 
different from the nodes in 0[ n ] and Y . Note that this assumption does not hold in the original context in which r 
has been introduced. There, Y is the observation of a stimulus that is presented to some neuronal system and the 
Oi represent the responses of (areas of) neurons to this stimulus. 
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Theorem 10 (lower bound on entropy of common ancestors). 

Let Xr n i be jointly distributed discrete random variables. If I c {X< n -\) > 0, then, in any Bayesian 
net containing the X\ n ] , there exists a common ancestor of strictly more than c variables out of 
the . Moreover, the entropy of the set A c+ \ of all common ancestors of more than c variables 
is lower bounded by 

H{A c+l ) > _£_/ c (Jf [n] ). 
n — c 

Wc continue with some remarks to illustrate the theorem: 

(a) Setting c = 1, the theorem states that, up to a factor l/(n — 1), the multi- information Ii is 
a lower bound on the entropy of common ancestors of more than two variables. In particular, if 
7i(X[ n ]) > any Bayesian net containing the X[„] must have at least an edge. 

(b) Conversely, the entropy of common ancestors of all the elements X\ , . . . , X n is lower bounded 
by {n — l)I n —i{X]S). This bound is not trivial whenever J„_x(Xr„i) > 0, which is for example the 
case if the Xi are only slightly disturbed copies of some not necessarily observed random variable 
(see example below). 

(c) We emphasize that the inferred common ancestors can be among the elements X^ themselves. 
Unobserved common ancestors can only be inferred by postulating assumptions on the causal 
influences among the X^ If, for example, all the JQ were measured simultaneously, a direct causal 
influence among the Xi can be excluded and any dependence or redundancy has to be attributed 
to unobserved common ancestors. 

(d) Finally note that I c > is only a sufficient, but not a necessary condition for the existence of 
common ancestors. However, we know that the information theoretic information provided by I c 
is used in the theorem in an optimal way. By this we mean that we can construct distributions 
p(X\ n ]), such that I c (Xua) = for a given c and no common ancestors of c + 1 nodes have to exist. 
We conclude this section with two examples: 

Example (three variables): Let X\,X 2 and X$ be three binary variables, each with maximal 
entropy H(Xi) = log 2. Then I 2 (X 1 ,X 2 ,X 3 ) > iff the joint entropy H(X 1 ,X 2 ,X 3 ) is strictly 
less than | log 2. In this case, there must exist a common ancestor of all three variables in any 
Bayesian net that contains them. In particular, any Bayesian net corresponding to the DAG on 
the right hand side of Figured] can be excluded as a model. 

Example (synchrony and interaction among random variables): Let X\ = X 2 = ■ ■ ■ = X n 

be identical random variables with non- vanishing entropy h. Then in particular 7 n _i(Xr„i) = 
(n — l)~ 1 h > and we can conclude that there has to exist a common ancestor of all n nodes in 
any Bayesian net that contains them. 

In contrast to the synchronized case, let X\, X 2 , . . . , X n be binary random variables taking values 
in { — 1,1} and assume that the joint distribution is of pure n- interaction^, that is for some ft ^ 
it has the form 

pp(xi,...,x n ) := —exp(f3x 1 x 2 ---x n ), 

where Z is a normalization constant. It can be shown that there exists a Bayesian net includ- 
ing the Xr„i , in which common ancestors of at most two variables exist. This is illustrated in 
Figure |4] for three variables and in the limiting case ft = oo in which each Xi is uniformly dis- 
tributed and X\ = X 2 ■ X 3 . We found it somewhat surprising that, contrary to synchronization, 
higher order interaction among observations does not require common ancestors of many variables. 



4.2 Common ancestors in string manipulation processes 

In some situations it is not convenient or straightforward to summarize an observation in terms of 
a joint probability distribution of random variables. Consider for example cases in which the data 
comes from repeated observations under varying conditions (e.g. time series). A related situation 

3 This terminology is motivated by the general framework of interaction spaces proposed and investigated by 
Darroch et. al. 1151 and used by Amari 1161 within information geometry. 
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Figure 4: The figure illustrates that higher order interaction among observed random variables 
can be explained by a Bayesian net in which only common ancestors of two variables exist. More 
precisely, all random variables are assumed to be binary with values in { — 1, 1} and the unobserved 
common ancestors Uij are mutually independent and uniformly distributed. Further the value 
of each observation X, is obtained the product of the values of its two ancestors. Then the 
resulting marginal distribution p(Xi, X 2 , X 3 ) is of higher order interaction: it is related to the 
parity function p(X\ = X\, X 2 — x 2 , X 3 = x 3 ) = i if X\X 2 x% = 1 and zero otherwise. 

is given if the number of samples is low. Janzing and Schoclkopf [3] argue that causal inference in 
these situations still should be possible, provided that the observations are sufficiently complex. To 
this end, they developed a framework for causal inference from single observations that we describe 
now briefly. Assume we have observed two objects A and B in nature (e.g. two carpets) and we 
encoded these observations into binary strings a and b. If the descriptions of the observations in 
terms of the strings a and b are sufficiently complex and sufficiently similar (e.g. the same pattern 
on the carpets) one would expect an explanation of this similarity in terms of a mechanism that 
relates these two strings in nature (are the carpets produced by the same company?). It is 
necessary that the descriptions are sufficiently complex, as an example of 0] illustrates: assume 
the two observed strings are equal to the first hundred digits of the binary expansion of it, hence 
they can be generated independently by a simple rule. If this is the case, the similarity of the two 
strings would not be considered as strong evidence for the existence of a causal link. To exclude 
such cases, Kolmogorov complexity [17] K(s) of a string s has been used as measure of complexity. 
It is defined as the length of the shortest program that prints out s on a universal (prefix-free) 
Turing machine. With this definition, strings that can be generated using a simple rule, such as 
the constant string s = • • • or the first n digits of the binary expansion of tt are considered 
simple, whereas it can be shown that a random string of length n is complex with high probability. 
Kolmogorov complexity can be transformed into a function on sets of strings by choosing a suitable 
concatenation function (•, •), such that K(s\, . . . , s n ) = K((s\, (s 2 , . . . , (s„_i, s„) . . .)). 
The algorithmic mutual information [5] of two strings a and b is then equal to the sum of the 
lengths of the shortest programs that generate each string separately minus the length of the 
shortest program that generates the strings a and b: 

I (a : b) = K{a) + K(b) - K{a, b) , 

where = stands for equality up to an additive constant that depends on the choice of the universal 
Turing machine. Analog to Rcichcnbach's principle of common cause, [3] postulates a causal 
relation among a and b whenever I(a : b) is large, which is the case if the complexities of the 
strings are large and both strings together can be generated by a much shorter program than the 
programs that describe them separately. 

In formal analogy to the probabilistic case, algorithmic mutual information can be extended to a 
conditional version defined for sets of strings A,B,CC {s l5 . . . , s n } as 

I {A : B \C) = K(A U C) + K(B U C) - K(A UBUC)- K(C) . 
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Intuitively, I (A : B \ C) is the mutual information between the strings of A and the strings of B if 
a shortest program that prints the strings in C has been provided as an additional input. Based 
on this notion of condition mutual information the causal Markov condition can be formulated 
in the algorithmic setting. It can be proven [4 to hold for a, directed acyclic graph G on strings 
Si, . . . , s n if every Sj can be computed by a simple program on a universal Turing machine from 
its parents and an additional string n, such that the rij are mutually independent. Without going 
into the details we sum up by stating that DAGs on strings can be given a causal interpretation 
and it is therefore interesting to infer properties of the class of possible DAGs that represent the 
algorithmic conditional independence relations. 
In the algorithmic setting, our result can be stated as follows 

Theorem 11 (inference of common ancestors of strings). 

Let Or n i = {si, . . . , s n } be a set of binary strings. If for a number c, (1 < c < n — 1) 

1 " + 
-Y J K(s i )-K(s 1 ,...,s n )>0, 
c f-f 

then there must exist a common ancestor of at least c + 1 strings out of 0[„] in any DA G-model 
of the O ln] E 

Proof: As described, algorithmic mutual information is an information measure in our sense only 
up to an additive constant depending on the choice of the universal Turing machine. However, 
one can check that in this case, the decomposition of mutual information (Theorem [7]) holds up 
to an additive constant that depends additionally on the number of strings n and the chosen 
parameter c. The result on Kolmogorov complexities follows by choosing Y = (s\, . . . , s n ), since 
K( Si )±I(Y : Si ). □ 

Thus, highly redundant strings require a common ancestor in any DAG-modcl. Since the Kol- 
mogorov complexity of a string s is uncomputable, we have argued in recent work j5] , that it can 
be substituted by a measure of complexity in terms of the length of a compressed version of s with 
respect to a chosen compression scheme (instead of a universal Turing machine) and the above 
result should still hold approximately. 

4.3 Structural implications from synergy? 

We saw that large redundancy implies common ancestors of many elements and we may wonder 
whether structural information can be obtained from synergy in a similar way. This seems not to 
be possible, since synergy is related to more fine-grained information (information about the mech- 
anisms) as the following example shows: Assume the observations Or n i are mutually independent. 
Then any DAG is a valid DAG-model since the local Markov condition will always be satisfied. 
We also now that r(Y) < 0, but it turns out that the amount of synergy crucially depends on the 
way that Y has processed the information of the Oui (and therefore not on a structural property 
among the 0[„] themselves). To see this, let the observations Oi be binary random variables which 
are mutual independent and distributed uniformly, such that 

n 

P(°[n]) = l[p(Oi) and p(0, = 1) = p(Oi = 0) = 1/2 . 

i=l 

Further let Y — (Oi © Oj)i<j be a function of the observations (addition is modulo 2). Then the 
0[„j arc highly syncrgctic with respect to Y, that is r\(Y) = — (n — l)log2. On the other hand, 
if Y = Ox © • • • O n , then n(Y) = - log2 only. 

Nevertheless, it is an easy observation that synergy with respect to Y can be related to an increase 

+ 

Here > means up to an additive constant dependent only on the choice of a universal Turing machine, on c 
and on n. 
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of redundancy after conditioning on Y. Since /(■ \ Y) is a measure of mutual information as well, 
we define a conditioned version of r in a canonical way as 

1 " 

r c (Z\Y) = - £ I(Z : 0< \Y) - I(Z : [n] \Y) , 

i=l 

with respect to some observation Z . If / can be evaluated on non-disjoint subsets, that is, if we 
can choose Z = Or n i , we have the following 

Proposition 12 (synergy from increased redundancy induced by conditioning). 

Let 0[„] = {0\, . . . , O n } and Y be arbitrary elements on which a mutual information function I 

is defined. Then 

r c (Y)=r c (0 [n] )-r c (0 [n] \Y), 

hence if conditioning on Y increases the redundancy of 0[„] with respect to itself, then r c (Y) < 
and the 0[ n ] are synergetic with respect to Y . 

Proof: Using the chain rule, we derive 

r c (0[n]) - r c (0 [n] \Y) = r c (Y) - r c (Y\0 [n] ) = r c (Y) , 
where the last equality follows because r c (Y\0^) = 0. □ 

Continuing the example of binary random variables above, mutual independence of the 0[„] is 
equivalent to ri(0[„j) = and therefore, using the proposition ri(Y) = — ri(0[ n ]\Y). Thus, if 

r = Oi©-"©o [n ], 

n(y) = -n(0 [n] \Y) =H(0 [n] \Y) -J2H(O t \Y) = -log2, 

i=l 

as already noted above. 

5 Discussion 

Based on a generalized notion of mutual information, we proved an inequality describing the de- 
composition of information about a whole set into the sum of information about its parts. The 
decomposition depended on a structural property, namely the existence of common ancestors in 
a DAG. We connected the result to the notions of redundancy and synergy and concluded that 
large redundancy implies the existence of common ancestors in any DAG-modcl. Specialized to 
the case of discrete random variables, this means that large stochastic dependence in terms of 
multi-information needs to be explained through a common ancestor (in a Baycsian net) acting 
as a broadcaster of information. 

Much work has been done already that examined the restrictions that are imposed on observations 
by graphical models that include latent variables. Pearl [HQS] already investigated constraints im- 
posed by the special instrumental variable model. Also Darroch et al. [T5] and recently Sullivant 
et. al [19] looked at linear Gaussian graphical models and determined constraints in terms of 
the entries on the covariance matrix describing the data (tetrad constraints). Further, methods of 
algebraic statistics were applied (e.g. [20]) to derive constraints that are induced by latent variable 
models directly on the level of probabilities. In general this does not seem to be an easy task due 
to the large number of variables involved and information theoretic quantities allow for relatively 
easy derivations of 'macroscopic' constraints (see also [2T|V 

Finally, we think that the general methodology of connecting concepts such as synergy and redun- 
dancy of observations to properties of the class of possible DAG-modcls is interesting, especially 
in the light of their causal interpretation. 
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A Semi-graphoid axioms and d-separation 



Consider the conditional independence relation that is induced by an information measure on a 
set of objects (A ± B\C I (A : B \C) = 0). Then 

Lemma 13 (general independence satisfies semi-graphoid axioms). 

The relation of (conditional) independence induced by an independence measure I on elements O 
satisfies the semi-graphoid axioms: For disjoint subsets W, X, Y and Z of O it holds 

(1) X±Y\Z => Y±X\Z (symmetry) 

(2) X X (Y, W) \ Z => I "y- ~!~ w (decomposition) 



v X X W\Z 

(3) X±(Y,W)\Z X\Y\{Z,W) (weak union) 

(4) X± ^x^]2} ^ X±(W,Y)\Z (contraction) 

The proof is immediate using non-negativity and the chain rule of mutual information. In the 
probabilistic context, the axiomatic approach to conditional independence has been presented by 
Dawid [H] . The above Lemma is important, since it implies that a DAG that fulfills the local 
Markov condition with respect to a set of objects is an efficient partial! representation of the 
conditional independence structure among the observations. Namely, conditional independence 
relations can be read off the graph with the help of a criterion called d-separation [T] . This is the 
content of the following theorem but before stating it we recall the definition of d-separation: Two 
sets of nodes A and B of a DAG are d-separatcd given a set C disjoint from A and B if every 
undirected path between A and B is blocked by C. A path that is described by the ordered tuple 
of nodes (xi, X2, ■ ■ ■ , x r ) with x\ S A and x r £ B is blocked if at least one of the following is true 

(1) there is an i such that x% S C and Xi-i — > Xi — > Xi+i or <— Xi Xi+i or 

•£% — 1 4 Xi r Xi-\-\ , 

(2) there is an i such that Xi and its descendants are not in C and xt-i — > Xi <— Xi+\. 
Theorem 14 (Equivalence of Markov conditions). 

Let I be a measure of mutual information on elements 0[„] = {0±, . . . , O n } and let G be a DAG 
with node set 0[„] . Then the following two properties are equivalent 

(1) (local Markov condition) Every node Oi of G is independent of its non-descendants O n d 
given its parents O pai , 

O t X o ndi \o pai . 

(2) (global Markov condition) For every three disjoint sets of nodes A, B and C such that A is 
d-separated from B given C in G, it holds A X B \ C . 

Proof: (1) — > (2). Since the dependence measure / satisfies the semi-graphoid axioms (Lemma 
[T5|) we can apply Theorem 2 in Verma & Pearl (55] which asserts that the DAG is an /-map, or 
in other words that d-separation relations represent a subset of the (conditional) independences 
that hold for the given objects. 

(2) —> (1) holds because the non-descendants of a node are d-separated from the node itself by 
the parents. □ 



5 In general there may hold additional conditional independence relations among the observations that are not 
implied by the local Markov condition together with the semi-graphoid axioms. In fact, it is well known that there 
so called non-graphical probability distributions whose conditional independence structure can not be completely 
represented by any DAG. 



14 



B Proof of Proposition [6] 



We have shown in Lemma [5] the submodularity of I(Y : •) with respect to independent sets. The 
rest of the proof is on the lines of the proof of Corollary I in [TJ]: First, by iteratively applying 
the chain rule for mutual information we obtain 



r-1 



I(Y:X [r] )=J2l(Y:X l+1 \X [t] ). (14) 



i=0 



Without loss of generality we can assume that every Xi is part of at least one set Ok for some k. 
Let rii be the total number of subsets Ok containing X%. By definition of dk, for every k it holds 
m < dk and we obtain 

} — < 7ij ■ max — < 1 . (15) 
^— ' dj Oj.fA'iSO,) a.; 

Oj,{Xi&Oj) 3 3 y 31 3 

Putting (|T4"]) and (|T5|) together we get 

r-1 

I(Y:0 [n] ) = I(Y:X [r] ) = '£l(Y:X i \X [i _ 1] ) 



i=0 



> 



(a) 



±i(Y:X t \X [l _ 1] )( £ 1; 



1 

EtE I(y:X 4 |X [w] ) 
i=i J ^eOj 

(6) " 1 x 

> EtE iiY-.Xtix^nOj) 

3=\ 3 x.eOj 

n 

Et^ : ^). 



d, 

where (a) is obtained by exchanging summations and (b) uses the property of /, that conditioning 
on independent objects can only increase mutual information (inequality Q applied to Xi X 
(X[j_i]\Oj) \Oj) . This is the point at which submodularity of / is used, since it is actually 
equivalent to ((4]) as can be seen from the proof of Lemma [5] Finally (c) is an application of the 
chain rule to the elements of each Oj separately. 



C Proof of Theorem [7] 

By assumption Oi C X and the DAG G with node set X fulfills the local Markov condition. For 
each Oi denote by anaiOi) the smallest ancestral set in G containing 0{. 

An easy observation that we need in the proof is given by the fact that two ancestral sets A and 
B are independent given their intersection: 

A\B ±B\A \Ar\B. (16) 

This is implied by d-separation using Theorem [T4l 
We first prove the inequality 

n 

I(Y : an G (0 [n] )) > £ -I(Y : an G (O t )) . (17) 

2=1 

From this the inequalities of the theorem follow directly: ([Sj) holds since I(Y : an(Oi)) > I(Y : Oi) 
using the monotony of / (implied by chain rule and non- negativity). Further, (fT0| is a direct 
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consequence of (|17|) together with the independence assumption ([5]), since by the chain rule 



I(Y : an G (0 [n] )) = I(Y : [n] ) + I(Y : an G (0 [n] )\0 [n] \0 [n] ) = I(Y : [n] ) , 
where the last equality is a consequence of 

The proof of (|17p is by induction on the number of elements in A = an G {0^). If A = nothing 
has to be proven. Assume now (fTT)) holds for Or„i = {Oi, . . . , 0„} such that A = U" =1 an(0i) is of 
cardinality at most k — 1. Let Or„i be a set of observations such that A is of cardinality k. From 
0[ n ] we construct a new collection Or„i as follows: W.l.o.g. assume m :— d\ > 0, in particular 
0\ is non-empty and moreover, by definition of d\ and after reordering of the Oi we can assume 
that the intersection V := fl™ ^n^Oi) is non-empty. Note that V itself is an ancestral set. We 
define Oi = Oi\V for all 1 < i < n and denote by G the modified graph that is obtained from 
G by removing all elements of V. Further, denote by I(A : B \C) := I(A : B \C, V) a modified 
measure of mutual information obtained by conditioning on V. One checks easily that the graph 
G fulfills the local Markov condition with respect to the independence relation induced by / and 
is a DAG-model of the elements 0\ n y Hence, by induction assumption 

n 1 

I{Y : an G (0 [n] )) > ^ jl{Y : an e (pi)) , (18) 

where di is defined similarly as d%, but with respect to the elements Oi and G. Further the sum 
is over all non-empty Oi. By construction of I and £?[„], the left hand side of (fT8|) is equal to 

I(Y : an 6 (6 [n] )) = l(Y:an G (0 [n] )\V\V)=I(Y:an G (O ln] ))-I(Y:V). (19) 
The right hand side of (fT8|) can be rewritten to 



n 1 ( a ) 71 1 

Efl^^w) > Ed- 7 ( y:a "G(oo) 

i—l 1 i—l 

rn 1 n 

- Erf- / ( y:a "G(0,:)\^|F)+ E ^/(yran^Oiy) 

i—l i— m+1 

(c) m 1 n 1 

> ET J ( y :an G (O i )\V\V)+ ^ -/(Y : an G (0,)) , 

2=1 ?=m+l 

where (a) follows because c?i > c?i by definition and (b) follows because an G (Oi) C\V = for i > m. 
Hence by (|16|) V and an G {Oi) are independent and therefore conditioning on V only increases 
mutual information as proven in Lemma [2] and inequality (c) follows. We continue by rewriting 
the first m summands of the right hand side using the chain rule 



TCI TCI 

]T-/(r:an G (O0\^) = £ x W : a ^(0,)) - I(y : V)] 

i—l i—l 

rn ^ 

^ [Ej^^^tO,))]-^:^), 

i=l ' 

where the inequality holds because Y^Li ~3~ — ^ w hich has already been used, see (fT5j) in the proof 
of Proposition [B]. Summarizing, the right hand side of (|18p can be bounded from below by 

n * n 1 

£ T /(y : an^O.,)) > £ : an G (Oi)) - /(F : V) . 



i=i 



Since we have shown in ([18]) and (|T9|) . that the left hand side can be bounded from above by 
I(Y : 0[ n ]) — I(Y : V), we observe that I(Y : V) cancels and pTt is proven. 
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D Proof of Corollary [8] 



Proof: Let G be a DAG- model of the observation of Or„i = {Ox, ... , O n }. We construct a new 
DAG G', by removing the objects of A := U" =1 A Ci+1 . Since A is an ancestral set G' fulfills the 
local Markov condition with respect to the mutual information measure obtained by conditioning 
on A. We apply Theorem[7]to G' and the observations O', , = {0\\A, . . . , 0„\A} to get 

n 

I(Y : an G ,(0' [n] ) \A) > £ -J(Y : 0[ \A) . (20) 

i— 1 

Using assumption (jllj) and the chain rule for mutual information we obtain 
I(Y:A) = I(Y : an G {0 [n] )) - I(Y : an G (0 [n] )\A\A) 
{ = ] I(Y : an G (0 [n] )) - I(Y : on G ,(0( n] ) \A) 



W " 1 

< '£-[l(y..O i )-I(y:q\A)\-e c 

2 — 1 

W 1 

L — ^ n. 



i=l 



where in (a) we used the definition of 0\ and for (6) we plugged in inequalities (JTTJ) and (|20p . 
Finally (c) holds because 

7(Y :Oi)-/(r:0;|A) = 7(Y : O, n A|0-) + I(Y : 0[) - I(Y : O'AA) 

= I(Y : O t n A\0[) + I(Y : A) - I(Y : A\0[) < I(Y : A) , 

where the chain rule has been applied multiple times. The corollary now follows by solving for 
I(Y : A). □ 
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